feature attn with state #14299

Zijie-Tian · 2025-06-20T13:41:34Z

Make sure to read the contributing guidelines before submitting a PR

- Introduced `run-prefill-decode-bench.sh` for executing prefill-decode benchmarks with customizable parameters. - Added `extract_bench_results.py` to process benchmark markdown files and extract structured data into CSV format. - Updated `.gitignore` to include `bench_results` directory for generated files.

- Introduced `analyze_benchmark_results.py` for processing benchmark CSV files and generating performance pivot tables. - Updated `run-prefill-decode-bench.sh` to support multiple KV cache types and added options for prompt length and forced alignment. - Modified `extract_bench_results.py` to accommodate broader file matching for markdown files.

- Introduced `run_op_bench.sh` to execute Flash Attention benchmarks with customizable parameters for head sizes, KV lengths, and quantization types. - Added `summary_flash_attn.py` for processing benchmark results, extracting performance metrics, and generating analysis summaries. - Enhanced test cases in `test-backend-ops.cpp` to include additional KV lengths and quantization types for comprehensive performance evaluation.

- Introduced a new profiling feature for the ggml library to track operation timings within computation graphs. - Added `ggml-profile.h` and `ggml-profile.cpp` to define profiling structures and functions. - Updated `CMakeLists.txt` to include options for enabling the graph profiler. - Modified existing source files to integrate profiling calls during graph computations, allowing for performance analysis. - Enhanced `CMakePresets.json` with new presets for profiling builds.

- Added a function to enable or disable GGML graph profiling based on a specified path. - Updated the `test_gen` function to conditionally set profiling during the last generation iteration. - Ensured profiling is reset after each benchmark run in the main function. - Improved overall profiling integration for better performance analysis during benchmarks.

- Updated the output format in `ggml-profile.cpp` to use a CSV style for better readability and easier parsing. - Introduced a global variable in `llama-bench.cpp` to manage the GGML_GRAPH_PROFILE setting, allowing for dynamic configuration. - Added a function to retrieve the current value of the GGML_GRAPH_PROFILE environment variable, enhancing flexibility in profiling setup.

- Introduced `run-breakdown.sh` to facilitate operator breakdown profiling with customizable parameters such as model path, thread count, output directory, and prefill depths. - Updated `.gitignore` to exclude specific breakdown results files. - Enhanced `llama-bench.cpp` to support profiling during prefill and decode operations, improving performance analysis capabilities.

- Introduced `analyze_breakdown.py` to parse CSV files, analyze operator performance, and generate visualizations. - Implemented functions for data cleaning, operator analysis, and visualization in both bar and pie chart formats. - Added command-line interface for processing multiple CSV files or a specific file, with options for generating comparison charts across depths.

- Introduced `SKIP_ANALYSIS` flag to allow users to skip the data analysis step during profiling. - Updated help information to include the new flag and its default value. - Added a function to check for Python dependencies and provide warnings if they are missing. - Adjusted output to reflect changes in how results are displayed based on the new flag.

- Added T-MAC quantization types and configurations in the ggml library. - Enhanced the `convert_hf_to_gguf.py` script to support T-MAC options and quantization configurations. - Updated CMake files to include T-MAC compilation options and source files. - Introduced new utility functions for T-MAC handling in the gguf Python module. - Modified existing quantization logic to accommodate T-MAC types and ensure compatibility with the new formats. - Improved model loading and tensor operations to leverage T-MAC optimizations.

- Added T-MAC quantization types and validation in ggml.h and ggml-quants.c. - Updated type traits and tensor size calculations in ggml.c to accommodate T-MAC types. - Enhanced CMake configuration to conditionally include T-MAC source files based on compilation flags. - Modified llama model loader and quantization logic to support T-MAC types. - Ensured compatibility and proper handling of T-MAC types across various components.

- Adjusted the T-MAC type count in ggml.h to reflect the correct number of types based on compilation flags. - Updated CMakeLists.txt to ensure proper inclusion of T-MAC definitions and directories, removing unnecessary comments for clarity.

- Introduced a new test file `test-quantize-accuracy.cpp` to evaluate the accuracy of quantization and dequantization processes. - Updated `CMakeLists.txt` to include the new accuracy test in the build process, ensuring comprehensive testing of quantization functionalities.

- Introduced a new option for QlutAttn in CMake configuration to enable its usage. - Updated CMakeLists.txt to conditionally compile QlutAttn related definitions and include directories. - Enhanced the ggml-base target to support QlutAttn functionality, ensuring proper integration within the library.

- Added additional T-MAC quantization types to the kv_cache_types in arg.cpp. - Updated ggml.h to reflect the correct count of T-MAC types without conditional compilation. - Enhanced llama-graph.cpp to support new T-MAC types in the attention mechanism, ensuring compatibility with existing functionality.

- Introduced a new example `flash-attn-inspector` to demonstrate the usage of flash attention in LLaMA models. - Added corresponding CMake configuration to include the new example in the build process. - Implemented the main functionality in `flash-attn-inspector.cpp`, including tensor data handling and logging for debugging purposes. - Enhanced the testing framework with a new test target for evaluating callback functionality during inference.

- Added entries for `breakdown_results` and `breakdown_results_llamacpp` directories to the .gitignore file, ensuring that generated files from breakdown profiling are excluded from version control.

- Updated `.gitignore` to include `breakdown_results_llamacpp/` directory. - Added documentation files for `ggml_structure.mdc` and `project_structure.mdc` to provide an overview of the project and its components. - Introduced `python_scripts.mdc` to outline the usage of Python scripts within the project. - Added new test files: `test-flash-attn.cpp` and `test-mul-mat.cpp` to validate the functionality of flash attention and matrix multiplication operations. - Updated `CMakeLists.txt` to include new test targets for improved testing coverage.

ggml-ci

- Introduced `ggml_cpu_structure.mdc` to detail the CPU-specific implementation of the GGML tensor library, including core source files, operation implementations, and architecture-specific optimizations. - Updated `ggml_structure.mdc` to reference the new CPU backend documentation, enhancing overall project clarity.

…ith quantized tensor support and improve computation graph handling

…flash_attn_ext_mixed-function Fix mask indexing in mixed flash attention and correct Q initialization

…efill-test-and-align_kv-mixed.sh Fix causal mask padding in flash decoding test

…ng and adding layer-wise K/V quantization operations. Improved logging for debugging and computation graph handling.

…0-quantization-007b Modify custom op for Q4_0 quantization

…ard-flash-attn-ext-f16-function-7321 Modify ggml_compute_forward_flash_attn_ext_f16 function

Implemented a function to convert ggml tensors to torch tensors using type traits, including support for various tensor types. Enhanced the dequantization function to utilize type traits for improved float conversion and added error handling for unsupported types. This update improves integration with PyTorch and facilitates better tensor management.

…sh-attn-ext-f16-febc Fixed ggml_compute_forward_flash_attn_ext_f16_with_state

Zijie Tian and others added 30 commits May 10, 2025 23:42

feat: update .gitignore to include new breakdown results

7160c4a

- Added entries for `breakdown_results` and `breakdown_results_llamacpp` directories to the .gitignore file, ensuring that generated files from breakdown profiling are excluded from version control.

kv-cache : prepare for SWA

c011e4e

ggml-ci

kv-cache : initial iSWA implementation

85f5fc5

ggml-ci

kv-cache : rework error recovery logic

b9ce306

ggml-ci

models : fix Phi-3 SWA parameters

a4aafa5

ggml-ci

model : adjust Granite to rope factor changes

c7d8175

ggml-ci

server : check if context can do shifts

554b4d0

ggml-ci

iswa : for now, always enable shifts (experiment)

4a258ff

ggml-ci

kv-cache : simplify SWA logic

e743246

ggml-ci

kv-cache : apply defrag when we fail to find slots for the batch

6390125

ggml-ci

llama : update docs about llama_decode

86c526a

ggml-ci

kv-cache : update warning logs when no space for the batch is available

0073157

ggml-ci

Zijie-Tian and others added 22 commits June 19, 2025 05:37

feat(kv-cache-monitor): extend flash attention model initialization w…

3c82056

…ith quantized tensor support and improve computation graph handling

Fix mixed flash attention mask indexing and Q init

8912dd7

Merge pull request #2 from Zijie-Tian/codex/fix-ggml_compute_forward_…

b745973

…flash_attn_ext_mixed-function Fix mask indexing in mixed flash attention and correct Q initialization

Fix mask padding in flash decoding test

d34f375

Merge pull request #3 from Zijie-Tian/codex/investigate-issue-with-pr…

c342fd0

…efill-test-and-align_kv-mixed.sh Fix causal mask padding in flash decoding test

Fixed bug on ARM

86a48c0

Implement Q4_0 quantization for key and value tensors in KV cache

46ffe04

Fix KV cache quantization with correct tensor offsets and 1D tensors

bd2f79a

Enhance quantization process in mixed KV cache by refining cell marki…

dc1b46b

…ng and adding layer-wise K/V quantization operations. Improved logging for debugging and computation graph handling.

Merge pull request #4 from Zijie-Tian/cursor/modify-custom-op-for-q4-…

0dbb203

…0-quantization-007b Modify custom op for Q4_0 quantization

Add flash attention state tensor for persistent S/M values

a4a42bf

Implement comprehensive flash attention state tensor test suite

5f4ad96

Changes from background composer bc-fd0cb829-5f89-46da-a420-3f651dd2977e

0cb4d04

Merge pull request #5 from Zijie-Tian/cursor/modify-ggml-compute-forw…

26ffe3e

…ard-flash-attn-ext-f16-function-7321 Modify ggml_compute_forward_flash_attn_ext_f16 function

Fix flash attention state tensor implementation and segmentation logic

42de407

Fix flash attention state restoration in segmented computation

2e32a32

Fix flash attention state management in segmented computation

4587fea

Remove large code block from ggml/src/ggml-cpu/ops.cpp

bc1ddba

Update random number generator seed in test for reproducibility

985f774

Merge pull request #7 from Zijie-Tian/cursor/ggml-compute-forward-fla…

a525b86

…sh-attn-ext-f16-febc Fixed ggml_compute_forward_flash_attn_ext_f16_with_state

[feature] Add ggml-flash-attn with kv segment.

104e5a0

Zijie-Tian closed this Jun 20, 2025

github-actions bot added documentation Improvements or additions to documentation build Compilation issues script Script related testing Everything test related examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature attn with state #14299

feature attn with state #14299

Uh oh!

Zijie-Tian commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feature attn with state #14299

feature attn with state #14299

Uh oh!

Conversation

Zijie-Tian commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants